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Abstract 

Background: All polypeptide backbones have the potential to form amyloid fibrils, which are associated with a 
number of degenerative disorders. However, the likelihood that amyloidosis would actually occur under 
physiological conditions depends largely on the amino acid composition of a protein. We explore using a naive 
Bayesian classifier and a weighted decision tree for predicting the amyloidogenicity of immunoglobulin sequences. 

Results: The average accuracy based on leave-one-out (LOO) cross validation of a Bayesian classifier generated 
from 143 amyloidogenic sequences is 60.84%. This is consistent with the average accuracy of 61.15% for a holdout 
test set comprised of 103 AM and 28 non-amyloidogenic sequences. The LOO cross validation accuracy increases 
to 81.08% when the training set is augmented by the holdout test set. In comparison, the average classification 
accuracy for the holdout test set obtained using a decision tree is 78.64%. Non-amyloidogenic sequences are 
predicted with average LOO cross validation accuracies between 74.05% and 77.24% using the Bayesian classifier, 
depending on the training set size. The accuracy for the holdout test set was 89%. For the decision tree, the non- 
amyloidogenic prediction accuracy is 75.00%. 

Conclusions: This exploratory study indicates that both classification methods may be promising in providing 
straightforward predictions on the amyloidogenicity of a sequence. Nevertheless, the number of available 
sequences that satisfy the premises of this study are limited, and are consequently smaller than the ideal training 
set size. Increasing the size of the training set clearly increases the accuracy, and the expansion of the training set 
to include not only more derivatives, but more alignments, would make the method more sound. The accuracy of 
the classifiers may also be improved when additional factors, such as structural and physico-chemical data, are 
considered. The development of this type of classifier has significant applications in evaluating engineered 
antibodies, and may be adapted for evaluating engineered proteins in general. 



Background 

Antibodies are used in a number of therapeutic proce- 
dures such as target-specific anti-cancer therapy, immu- 
nosuppression, and purging prior to bone marrow 
transplants. Most of those antibodies are of nonhuman 
origin, and their administration often results in the gen- 
eration of adverse immune responses, which also limit 
their efficacy [1]. Humanization is usually performed to 
lessen the occurrence of these responses, to improve cir- 
culation half-life, and to restore effector functions [1,2]. 
Current humanization strategies include the retention of 
variable domains or the specificity-determining residues 
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(SDR) only, grafting of complementarity-determining 
regions (CDR), and veneering [3-6]. 

Humanization, however, may decrease the thermal sta- 
bility of an antibody and result in affinity reduction, as 
well as amyloid fibril formation, especially when the 
substitutions leave the humanized antibody prone to 
unfolding [3,7,8]. Studies indicate that the potential to 
form fibrils is a general property of polypeptide chains, 
but the propensity for amyloidosis is largely influenced 
by its sequence and the stability of its native state 
[9-11]. Furthermore, there is evidence that some anti- 
body sequences, notably kappa light chain sequences, 
become prone to fibril formation due to point mutations 
acquired during affinity maturation [12]. Apart from 
these, events that lead to misfolding, such as conforma- 
tional transitions between alpha helices and beta sheets. 
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and partial or complete unfolding, could lead to amyloi- 
dosis [13-15]. Consequently, it would be of interest to 
develop a method to predict such events, as well as to 
identify mutations that could lead to amyloidosis. Cur- 
rently, a number of computational methods are available 
for amyloidogenic potential prediction [16-18]. These 
generally use either the physicochemical properties of 
amino acids to create models for predicting aggregation 
rate on mutation and identifying hotspots, or the infor- 
mation from overlapping amyloidogenic polypeptide 
decomposition [17]. Recently, a method using mean 
packing density profiling has also been reported, and 
has been found to be able to predict both amyloidogenic 
and intrinsically disordered regions in both peptides and 
proteins [19]. Nevertheless, these methods yield predic- 
tions on which regions of a sequence are potentially 
amyloidogenic; for highly similar sequences, as the case 
is with both amyloidogenic and non-amyloidogenic anti- 
bodies, results from such methods are not so easy to 
distinguish (See Supplementary Information, additional 
file 1). In this paper, we explore the use of naive Baye- 
sian and decision tree classification methods for predict- 
ing the amyloidogenic propensities of antibody 
sequences, with the primary application of predicting 
amyloidogenic propensities of engineered antibodies in 
mind. The naive Bayesian method provides the advan- 
tage of taking the effects of mutations at specific combi- 
nations of positions into account. The decision tree, on 
the other hand, intuitively allows the evaluation of more 
factors that may contribute to the amyloidogenic poten- 
tial. For generating the classifiers in both methods, 143 
amyloidogenic antibody sequences derived from twelve 
different germlines and 158 corresponding non-amyloi- 
dogenic derivatives were used. The unambiguous assign- 
ment of amyloidogenic and non-amyloidogenic 
sequences to their respective germlines is a critical pre- 
mise in this paper. Germlines are DNA elements that 
define the basic, inherited antibody repertoire of an indi- 
vidual, which are rearranged and mutated during the 
response to foreign antigens [20]. As indicated pre- 
viously, some sequences become prone to fibril forma- 
tion after this mutation process [12]; consequently, the 
generation of separate alignments for the amyloidogenic 
and non-amyloidogenic derivatives of a single germline 
might lead to the identification of mutation patterns or 
characteristics exclusively associated with amyloidosis. It 
is critical that sequences are assigned correctly to a 
germline in order to ensure that the mutations observed 
are actual mutations, and do not arise from incorrect 
alignments. All alignments used in this paper are hand- 
annotated. 

To test the classifiers and to evaluate the effects of the 
training set size, a holdout test set consisting of an addi- 
tional 103 amyloidogenic sequences and 28 non- 



amyloidogenic sequences for eight of the twelve germ- 
lines was used. The naive Bayesian method, which is 
solely based on positional information, yields a predic- 
tion accuracy of 60.84% for amyloid-formers after LOO 
cross-validation, which is consistent with the 61.16% 
accuracy for the holdout test set. When the latter is 
included in the training set, LOO cross-validation accu- 
racy increases to 81.08%. Sequences classified using a 
decision tree, on the other hand, yielded an average pre- 
diction accuracy of 78.64% for the holdout test set. 

Results 

A direct implementation of the Naive Bayesian method 
results in prediction accuracies between 60.84% and 
81 .08% 

LOO cross-validation was performed to evaluate the 
accuracy of the Bayesian classifier; this particular 
method was used to allow the calibration data to be 
reused as test samples while simulating the prediction of 
future unknowns [21]. The average accuracy from this 
validation was at 60.84 ± 35.96% for classifying amyloi- 
dogenic sequences, with 25.95% of the non-amyloido- 
genic sequences being misclassified (Table 1, AMC and 
NAMC). Validation performed on the holdout test set 
yielded an average accuracy of 61.16 ± 13.75%, which 
falls within the LOO cross validation result (Table 1, 
AM Test). 

To evaluate the effects of training set size, the holdout 
test set was combined with the original training set to 
generate a new set of classifiers. These were again sub- 
jected to LOO cross-validation, yielding a higher average 
accuracy of 81.08 ± 29.33% (Table 1, AMC, new). 

Germline-specific decision trees result in an average 
prediction accuracy of 78% 

In order to construct a decision tree, we analyzed the 
nature of the mutations exclusively associated with amy- 
loid formers using an algorithm and accompanying 
visualization program that we have previously developed 
[22,23]. Results indicate that most of the mutations that 
occur exclusively in CDR residues or in FR residues of 
amyloidogenic derivatives are most likely the biggest 
contributors to misfolding, with 69% of the mutations in 
exposed CDR resulting in a general increase in sheet- 
forming propensity, as opposed to the 36% in buried 
FRs (Figures 1 and 2; Table 2). In contrast, the comple- 
ments (31% for exposed CDRs and 64% for buried FRs) 
resulted in decreased sheet-forming propensities. We 
used these information as branch weights for an initial 
decision tree (Table 3); before establishing the weight 
thresholds for classification, however, we checked if 
paths taken by amyloidogenic and non-amyloidogenic 
derivatives can be generalized. Interestingly, we 
found no consensus paths for either amyloidogenic or 
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Table 1 Naive Bayes classifier accuracy 



Germline 


AMC^ 




NAMC 




AMC, new^ 


NAMC, new 


AlVl Test^ 


NAM Test 




C 


fK 


C 


A 


C 


/\ 


C 


f\ 


C 


A 


C 


A 


J0024o 


5 


8 


1 3 


1 5 


N.A. 


N.A. 


N.A. 


N.A. 


N.A. 


N.A. 


N.A. 


N.A. 


M 30446 


0 


6 


7 


1 0 


N.A. 


N.A. 


N.A. 


N.A. 


N.A. 


N.A. 


N.A. 


N.A. 


X7281 3 


0 


6 


1 8 


1 9 


1 


8 


1 9 


1 9 


1 


2 


N.A. 


N.A. 


x/0 1 ^ -11-1 


1 2 


22 


1 2 


16 


31 


33 


1 5 


1 5 


9 


1 1 


N.A. 


N.A. 


X93627 


6 


1 2 


14 


14 


17 


19 


13 


14 


4 


7 


N.A. 


N.A. 


X93632 


0 


5 


8 


9 


N.A. 


N.A. 


N.A. 


N.A. 


N.A. 


N.A. 


N.A. 


N.A. 


X93640 


6 


11 


10 


13 


9 


17 


12 


13 


4 


6 


N.A. 


N.A. 


Z22188 


11 


15 


10 


12 


29 


34 


9 


12 


13 


19 


N.A. 


N.A. 


Z22191 


0 


5 


9 


9 


N.A. 


N.A. 


N.A. 


N.A. 


N.A. 


N.A. 


N.A. 


N.A. 


Z22197 


7 


8 


0 


6 


14 


26 


10 


17 


12 


18 


11 


11 


Z22208 


7 


13 


12 


14 


31 


35 


13 


20 


8 


22 


4 


4 


Z73673 


26 


32 


4 


21 


49 


50 


25 


34 


12 


18 


10 


13 


Accuracy {%) 


60.84 ± 35.96 


74.05 ±31.49 


81.08 ± 29.32 


77.24 ± 1 3.04 


61.16 ± 13.75 


89.28 ± 13.32 



^ Classifiers generated with the original training set comprised of 143 amyloidogenic and 158 non-amyloidogenic sequences 
^ Classifiers generated with the original training set and the holdout test set 
^ Results using AMC/NAMC, i.e. old classifiers 
C = Correct; A = Actual 




Figure 1 Normalized mutation matrices of amyloidogenic (Column A) and non-amyloidogenic derivatives (Column B) of 12 antibody 
germlines. Original residues are in rows and corresponding replacement residues are in columns. The amino acids have been arranged 
according to increasing j8-sheet forming propensities [54]. The intensity matrix of the difference between the amyloidogenic and non- 
amyloidogenic matrices (Column C) reflects the relative predominance of a mutation type in either amyloid or non-amyloid formers. A fourth 
matrix set (Column D) is used to indicate the mutations that occur exclusively in amyloidogenic derivatives. Separate matrices were generated 
for mutations in buried CDR, exposed CDR, buried FR and exposed FR positions. 
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Figure 2 Analysis of mutations exclusive to amyloidogenic derivatives. A rough analysis of mutation patterns could be made by dividing 
the matrix using the diagonal, or by dividing it into quadrants. Mutations to the right of the diagonal are characterized by increased sheet- 
forming propensities (+), while those to the left imply the opposite (-). In terms of the quadrants, which are numbered in the same way as the 
Cartesian plane, the first contains information on mutations from low- to mid-propensity, sheet-associated amino acids to relatively high- 
propensity sheet-associated amino acids (++), while the third quadrant contains the opposite (-). In the most general sense, mutations either on 
the right of the diagonal, or in the first and third quadrants (shaded), would be the biggest contributors to destabilization. The analysis indicates 
that a significant number of mutations in the exposed CDR residues result in increased /3-sheet-forming propensities, while mutations in buried 
FR residues tend to be associated with a decrease in y8-sheet-forming propensities. 



Table 2 Summary of mutations exclusive to amyloid 
formers 



Exposure, 
Region 


Increased ;8-sheet-forming 
propensity 


Decreased j3-sheet- 
forming propensity 


Exposed 
CDR 


20 


9 


Exposed FR 


20 


19 


Buried CDR 


21 


16 


Buried FR 


12 


21 



non-amyloidogenic sequences; instead, consensus paths 
appear to exist for each germUne (Figure 3A, Table 4). 
Consequently, we constructed a second decision tree 
which takes the germline of origin into account, as the 
case was in the Bayesian analysis. Depending on the germ- 
line, weights along selected paths are either boosted or 
decreased (Figure 3B, Table 4). Thresholds for separation 
were chosen to maximally distinguish samples in the train- 
ing set (Table 5), and are evaluated using the holdout test 
set. Table 6 lists the classification results per germline. 



Discussion 

The diversity of the antibody repertoire is generated 
through the combinatorial recombination of a small 
pool of germline genes and its somatic hypermutation. 
Nevertheless, these diversification processes have set- 
backs, including the generation of autoreactive antibo- 
dies as well as structurally compromised antibodies [24]. 
The latter are implicated in diseases that range from 
benign, high-level soluble light-chain production to 
pathological deposition in glomerular basal membrane 
cells, bone marrow plasma cells, interstitial tissues, 
arterial walls and basement membranes [24,25]. These 
unwanted effects often result from a set of mutations 
whose consequences on the structure are not so evident, 
so much so that the resulting unstable light chains 
evade elimination during posttranslational quality con- 
trol [24,26]. Avoiding such mutations or combinations 
thereof is critical in antibody engineering. 

From studies carried out on amyloidogenic antibodies, 
some patterns that can be linked to amyloidosis have 
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Table 3 Decision tree weights 





Edge 


Weight 


Reference for weight 




CDR 


1.0 


Ratio of CDR:FR mutations 




FR 


0.79 




CDR - exposed 


0.78 


Ratio of buried:exposed CDR mutations 


CDR - buried 


1.0 




FR 


- exposed 


1.0 


Ratio of buried:exposed FR mutations 


FR - buried 


0.85 




CDR - 


cxpuseu 


0.69 


Ratio of mutations increasing (A) 
sineet-forming propensities to mutations 
decreasing (v) sheet-forming 
propensities in exposed CDR residues 


CDR - 


■ exposed - ' 


0.31 




CDR 


- buried - A 


1.00 


Ratio of mutations increasing (A) 
sheet-forming propensities to mutations 
decreasing (v) sheet-forming 
propensities in buried CDR residues 


CDR 


- buried - v 


0.76 




FR - 


exposed - A 


1.00 


Ratio of mutations increasing (A) 
sineet-forming propensities to mutations 
decreasing (v) sheet-forming 
propensities in exposed FR residues 


FR - 


exposed - v 


0.95 




FR ■ 


- buried - A 


0.74 


Ratio of mutations increasing (A) 
sheet-forming propensities to mutations 
decreasing (v) sheet-forming 
propensities in buried FR residues 


FR ■ 


- buried -v 


0.43 





been found. Poshusta and co-workers, for instance, have 
reported that non-conservative mutations account for 
0.6 - 0.79 of the total mutations in Vx sequences, while 
0.4 - 0.59 account for the mutations in sequences 
[27]. They also reported differences in the location of 
these mutations in patients with different secreted levels 
of light chains. Specifically, it is implied that the position 
of mutations, and not the amount secreted, plays a more 
important role in light chain amyloidogenic propensity, 
based on studies on patients with very low light chain 
levels but advanced amyloid deposition [27]. Conse- 
quently, it is clear that two factors, at the minimum, 
have to be considered in generating a protocol for pre- 
dicting amyloid formation: the combination of positions 
at which the mutation occurs, as well as how these 
affect the structural stability of the antibody. 

A review by Caflisch [17] classified the computational 
approaches used in predicting protein and peptide 
aggregation propensity into two general groups. The 
first makes use of the physicochemical properties of the 
amino acids to create phenomonological models for pre- 
dicting aggregation behavior on mutation. The second, 
on the other hand, uses the decomposition of amyloido- 
genic peptides into overlapping segments. These are 
then simulated to the level of atoms to obtain estimates 
of aggregation propensity, as well as the structural 



details of the aggregates. Some programs that have since 
been developed to deal with amyloidosis include the 
PASTA server [28,29], a fibril prediction program [30], 
AGGRESCAN [16], Zyggregator [31], and Pafig [32], 
among others. Nevertheless, these algorithms deal with 
the prediction of the segments involved or possibly 
involved in amyloidosis, but do not generate direct pre- 
dictions on whether a given sequence will be amyloido- 
genic or not. Here, we propose methods that may be 
used to complement existing prediction protocols in 
obtaining direct predictions about the amyloidogenicity 
of an antibody sequence; the method may be extended 
to other protein types, provided that there are suffi- 
ciently related positive and negative training sets. 

A Naive Bayesian classifier uses probabilities to link 
hypotheses to events defined by a set of attributes. In 
Mitchell [33], the Naive Bayesian classifier v^^B is 
defined as: 

n 

= arg max P(Vj) P(fli | Vj) (1) 

1=1 

where v, is one of a set of V classes and a; is one of n 
attributes describing an event. 

This approach is attractive for the current problem, 
where there are only two possible outcomes. The most 
straightforward way of applying it is to use information 
of the combinations of positions at which mutations 
occur in amyloidogenic and non-amyloidogenic deriva- 
tives of a single germline. For example, to gauge the 
probability that a test sequence x derived from a germ- 
line g will be amyloidogenic, one would use the Bayes 
equation to evaluate the association between the posi- 
tional combination of mutations, c, in x and the two 
hypotheses: 

p{x is AM] = PamX p{x I AM) X 

(2) 

Pi^m, \AM)x...xp[x^^ I AM) 

p{x is NAM) = p^j^M X P{Xm, I NAM) x 

(3) 

p{x^jNAM)^-^P{Xm„\NAM) 

where x,n\, x^ii ■■■> x^n define c, and with Pam ^rid 
Pnam being defined by the positional mutational prob- 
abilities in amyloidogenic and non-amyloidogenic deri- 
vatives, respectively. Applying this method (Methods 
section, equations 4 and 5; Figure 4) yielded an average 
prediction accuracy of 60.8%; for an independent test 
set, the accuracy was 61.16% (Table 1). When the test 
set is used for training as well, the accuracy of amyloid 
sequence classification increases significantly. Misclassi- 
fication of non-amyloidogenic sequences is also reduced 
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X93627 .X93640 
Z2218S 



Z22I97 ' Z22208 

J002^18 >;93620 



X93e20 

Z22191 



M30446 :i;,3 222168 
X93632 >;72ei3 Z22197 



X72S13 Z22208 
X93e40 X93e32 



222191 

X93627 



Amyloidogenic 
derivatives 
Non-amyloidogenic 
derivatives 



Germline 
(1. .n| 




Path with boosted score 

Path with decreased score 



Figure 3 Decision tree for the evaluation of individual mutations. A decision tree (A) was constructed in order to evaluate the contribution 
of a mutation to amyloidogenicity. A path is followed for each mutation, depending on its position and exposure, as well as on the increase or 
decrease in sheet-forming propensity associated with it. Each path leads to one of eight terminal nodes, which is associated with a score, 
defined as the product of the weights (in italics) along the path leading to it. An analysis of paths taken by amyloidogenic and non- 
amyloidogenic derivatives of the different germlines indicated that different pairs of terminal nodes may be used to provide maximum 
separation between these derivatives. For instance, amyloidogenic derivatives of X93627 mostly end in leaf 1, while the non-amyloidogenic 
counterparts are more frequently associated with leaf 7; germline derivatives that can be distinguished using specific terminal nodes are 
indicated in the illustration. Based on this analysis, a final tree (B) was created which branches first on the basis of the germline to which the 
derivative being tested belongs; the structure and weights of the original tree (A) are kept. Each edge emanating from a germline node is 
connected to a copy of the original tree, where weights on paths which could be used for maximizing the separation between amyloidogenic 
and non-amyloidogenic derivatives are either boosted or decreased tenfold. For the illustrative example in (B), paths for J00248 (Germline 1) and 
Z22208 (Germline n) are shown. 
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Table 4 Summary of leaves providing maximum separation between amyloidogenic and non-amyloldogenic 
derivatives of different germiine sets* 



Leaf 


J00248 


M30446 


X72813 


X93620 


X93627 


X93632 


X93640 


Z22188 


Z22191 


Z22197 


Z22208 


Z73673 


1 


0.091 


0.009 


0.024 


-0.016 


0.042 


0.046 


-0.001 


0.044 


0.028 


0.036 


-0.032 


-0.036 


2 


-0.030 


0.008 


0.009 


-0.013 


-a/35 


-0.093 


0.022 


-0.075 


0.073 


0.089 


0.052 


0.062 


3 


-0.038 


-0,001 


0.071 


-0.035 


-0.038 


-0.209 


0.100 


-0.085 


-0.035 


-0.017 


0.068 


0.003 


4 


-0.058 


0.030 


-0.145 


-0.017 


0.053 


0.116 


-0.008 


-0.123 


0.058 


-0.198 


0.039 


0.014 


5 


-0.044 


0.007 


0.056 


0.065 


0.018 


0.070 


-0.009 


-0.081 


-0.092 


-0.057 


-0.025 


0.008 


6 


0.132 


-0.028 


0.043 


0.004 


0.026 


0.070 


0.012 


0.079 


0.058 


0.026 


-0.006 


-0.029 


7 


-0.058 


-0.031 


-0.054 


-0.052 


-0.048 


0.000 


-0.018 


0.102 


-0.082 


0.158 


-0.105 


-0.016 


8 


0.007 


0.006 


0.066 


0.063 


0.083 


0.00 


-0.099 


0.139 


-0.01 1 


-0.037 


0.009 


0.040 



* Values were obtained by substracting tlie percentage of mutations in non-amyloldogenic derivatives from the percentage of mutations in amyloidogenic 
derivatives terminating in a given leaf Minimum and maximum values per germiine set, which were used to identify the paths where scores were decreased and 
boosted, respectively, are shown in italics and boldface, respectively. 



Table 5 Summary of thresholds 



Germiine 






Threshold 




J00248 






1.70 




M30446 






1.50 




X7281 3 






1.75 




X93620 






0.65 




X93627 






0.85 






X93640 






2.50 




Z22188 






0.80 




Z22191 






0.75 




Z22197 






0.65 




Z22208 






1.50 




Z73673 






0.75 




Table 6 Decision tree classification accuracy* 


Germiine 


AlVI 




NAM 




J00248 


N.A. 


N.A. 


N.A. 


N.A. 


M30446 


N.A. 


N.A. 


N.A. 


N.A. 


X72813 


1 


2 


N.A. 


N.A. 


X93620 


9 


11 


N.A. 


N.A. 


X93627 , , '^.A, N,A, 


X93632 


N.A. 


N.A. 


N.A. 


N.A. 


X93640 


3 


6 


N.A. 


N.A. 


Z22188 


14 


19 


N.A. 


N.A. 


Z22191 


N.A. 


N.A. 


N.A. 


N.A. 


Z22197 


13 


18 


9 


11 


Z22208 


19 


22 


3 


4 


Z73673 


15 


18 


9 


13 


Average accuracy (%) 


78.64 ± 17.44 


78.64 ± 6.30 


* N.A. Indicates that no additional sequences were obtained for this germiine. 



by an average of 3% (Table 1, NAM Test). This correla- 
tion between the size of the training set and prediction 
accuracy has been previously observed [34]. It may be 
noteworthy to mention that the prediction accuracy for 
derivatives of the germiine X72813 did not improve sig- 
nificantly even after the augmentation of the data set. 
Predictions for this germiine are similarly low with the 
decision tree. Interestingly, most of the derivatives of 
X72813 are implicated in light chain deposition disease 
(LCDD). An interesting feature of LCDD-associated 
sequences is that when these are synthesized in vitro, 
the resulting proteins do not aggregate. Furthermore, 
the analysis of these sequences frequently show no 
obvious predisposition towards misfolding [35]. This 
may be a possible explanation for the difficulty in 
obtaining correct predictions for its amyloid-forming 
derivatives. If this set is treated as an oudier, the average 
prediction accuracy is 83.64 + 18.49%. 

In general, however, it is imperative to increase the 
training set size - not only in terms of the number of 
derivatives per germiine, but in terms of the number 
of germlines covered, in order to improve the perfor- 
mance of the classifier. A development of a program 
for automatically generating training sets is a non-tri- 
vial task, however, and is beyond the scope of this 
study. It could also be possible to consider other char- 
acteristics, such as the physico-chemical and structural 
effects of a mutation, as factors for defining ^am or 
^NAM ■ Nevertheless, the question of how such factors 
would be incorporated in the calculation has to be jus- 
tified first, from both statistical and biological points- 
of-view. Since our main interest is to provide a proof- 
of-concept that a simple set of classification algorithms 
may be used for predicting amyloidosis, we opted to 
complement the Bayesian method with a decision tree, 
where one could factor in additional effects of muta- 
tions for classifying sequences. 
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Amyloid, f onneirs 



Non- amyloid, formers 



Germline 

Derivative 1 

Derivative 2 

Derivative 3 

Derivative 4 

Derivative 5 

Deri vat ive 6 

Derivative 7 



SQSVLYSSNNKNY 
SQNLLDSSFDTNT 
SQSVLYSSNSKNY 
SRSVLSDSNSRNL 
SQSVLYSSNNKNY 
SLSVFFSPNNKNY 
SQSVLYNSNNKNF 
SQSVLLSFKNDNY 



Germline SQSVLYSSNNKNY 

Derivative 1 SQSVLFS SNNKNY 

Derivative 2 SQSVLYSSNNKNY 

Derivative 3 SQSVLYDSNNKNY 

Derivative 4 SQS I LDS SNNRN Y 

Derivative 5 SQSVLYTTKNKNH 

Derivative 6 SQTVLYS SNNKNY 

Derivative 7 SQS I LYS SNNKNY 



Position 

1 

2 

3 
4 

5 
6 
7 

8 

9 
10 
11 
12 
13 



Test sequence: 

Amyloidogenic 
probability : 

Non-amyloidogenic 
probability : 



Mutation Propensity 
Amyloid formers Non-amyloid formers 



0/7 
2/7 
1/7 
1/7 
1/7 
4/7 
2/7 
2/7 
2/7 
3/7 



3/7 



3/7 



SQSVLYSSNNKNY , , 

i + I :fm 

8,6,7,7.7,4.6,6.6.5,4 sjil ^ + 2 
9999 9 999999 9 9 



768667782 87 
9 9 ~9 9 9 9 9 9 9 9 9 



0/7 
0/7 
1/7 
2/7 
0/7 
2/7 
2/7 
1/7 
1/7 
0/7 



1/7 



0/7 



1/7 



(7-3)+ 1 ., 
7+2 

(7-1)+ 1 :c 
7+2 



I±I:f 
7 + 2 



Prediction : non-amyloidogenic 

Figure 4 Application of tlie naive Bayesian method for the prediction of amyloidosis. Given a set of amyloidogenic and non- 
amyloidogenic derivatives of a single germline, it is possible to generate the probability that a mutation at a particular position would cause 
amyloidosis or not. Briefly, separate mutation propensities for amyloid (Pam) and non-amyloid (pmM) formers are generated by counting the 
frequency of mutations per position. These fractions, as well as complements thereof (i.e. the probability that there will be no mutation in either 
an amyloid-former or non-amyloid-former at a particular position, in black) are subsequently used to compute the amyloidogenic and non- 
amyloidogenic probabilities of a test sequence. To calculate for the amyloidogenic probability of a test sequence, a probability is assigned to 
each of the n positions in the sequence based on the characteristic of that position (i.e. if it contains a mutation or not). For positions 
containing no mutations this probability is equivalent to q^M, Qam = 1 - Pam for position x. The probability for positions with mutations is equal 
to Pam ■ Non-amyloidogenic probabilities are calculated in a similar manner, but with the use of Pfj^M instead of p^^ ■ To avoid multiplications by 
zero, the Laplace correction is used. A product of the probabilities is subsequently taken; if the product of amylodogenic probabilities is higher, 
the test sequence is classified as amyloidogenic. 
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Decision trees are particularly useful in classifying 
unknowns into one of a finite number of categories, 
based on the results of a series of tests on the attri- 
butes of a sample [36,37]. It works by posing a series 
of questions about the features associated with 
unknowns; each question is contained in a node, and 
each node has child nodes for each possible answer to 
its question [38,39], It eventually terminates in leaves, 
which correspond to a classification. There are many 
variants of decision trees; in the simplest form, 'yesV 
'no' paths are followed throughout the classification 
process; in others, probability distributions over the 
classes are used in order to estimate the conditional 
probability that an item reaching a leaf belongs to the 
class if defines [39]. In biology, it has been used in 
Parkinson's disease management [40], disease severity 
profiling [41,42], toxicity analysis [43], large-scale pro- 
teomic studies [44,45], microarray data classification 
[46] and phylogenetic analysis, among other applica- 
tions. Depending on the number of factors that will be 
considered to classify the samples, decision trees may 
be made by hand or constructed automatically using a 
learning or an optimization algorithm [38,47]. Choos- 
ing these factors and its arrangement on the tree to 
optimally separate samples remain challenges in the 
creation of decision trees; algorithms have since been 
developed for optimal tree creation [36-38]. For this 
study, four splitting variables were considered, based 
on the mutation trends observed in both amyloido- 
genic and non-amyloidogenic samples. 

In order to obtain weights for the splitting variables, 
mutation matrices were generated for the amyloiodo- 
genic and non-amyloidogenic derivatives of the differ- 
ent germlines. An interesting result from the analysis 
of these matrices is that 69% of the mutations exclu- 
sively found in exposed CDR residues of amyloid for- 
mers appear to be implicated in higher sheet-forming 
propensities, while 64% exclusive to buried FR residues 
involve shifts to residues with lower sheet-forming 
propensities (Figures 1 and 2, Table 2). This may sug- 
gest that mutations stabilizing sheet structures in the 
CDR, which normally assume loop structures, contri- 
bute as much to amyloidosis as those that destabilize 
the sheet structure in critical regions (i.e. buried FR 
residues). This is not unlikely, based on some previous 
observations. Hurle et al. [48], for instance, performed 
a positional analysis of 36 amyloidogenic sequences to 
find mutations that occur in less than 1% of all 
sequences at a particular position. These mutations 
were mostly found in CDRs, notably CDRl, for both k 
and A light chains. Furthermore, Stevens et al. 
observed that 24 out of the 26 invariant residues in k 
light chains which drastically affect the structure of the 
antibody upon mutation are found on the protein 



surface, and make no obvious contributions to folding. 
Mutations in CDRs are generally more varied, and its 
contributions to amyloidosis, though not as easy to 
pinpoint, are probably very significant [49]. Finally, 
these results are consistent with predictions using 
other methods (see supplementary information, addi- 
tional file 1); this consistency may be viewed as a vali- 
dation of our observations. 

From these observations, a decision tree was created 
to approximate the contribution of each mutation to the 
overall amyloidogenicity of a sequence. The use of this 
tree on the independent test set yielded a prediction 
accuracy of 78.64% (Table 6), which is close to the 75% 
prediction accuracy obtained when the decision tree is 
tested on training set sequences. LOO cross validation 
was not performed for this method, since this would 
require weights to be changed as many times as there 
are sequences. Classifiers generated with the training set 
appear to have a better performance than those from 
the naive Bayesian method. One possible reason was 
that more factors are taken into consideration - one 
approximates the effect of the mutation itself, as well as 
the effect that it has in being at a particular region; at 
the same time, it also roughly approximates the com- 
bined effect of mutations, which are likely to be equally 
responsible for misfolding as individual mutations 
[27,50]. Nevertheless, this does not imply that the naive 
Bayesian method is entirely without merit, since it is 
clear that position or combinations of positions where 
mutations occur has a key role in amyloidosis [27]. It is 
also evident that more sequences have to be used, as 
with the naive Bayesian method. Prediction results will 
also be probably improved by including additional fac- 
tors such as hydrophilicity, size and charge changes as 
splitting variables, or refining the positions based on 
precedent studies [27]. In adding splitting variables, the 
construction of a decision tree could be performed 
using an [automated] optimization algorithm [38]. 

A caveat for both methods, however, is the possibi- 
lity of overfitting, which is the description of random 
error, instead of true correlations. This phenomenon is 
one of the key problems in machine learning, and may 
occur when there are more degrees of freedom than 
data [51,52]. Overfitted model results are not represen- 
tative of the population behavior, and are unlikely to 
be replicated. There are several rules of thumb for 
avoiding overfitting, which includes having a minimum 
of 10 - 15 observations per predictor variable, with lar- 
ger sample sizes required in cases where the effect 
sizes are small, or when predictors are highly corre- 
lated [52]. For binary response models, the sample size 
may not be directly relevant [52], although for this 
problem, it appears that sample size plays an impor- 
tant role. Due to the limited sample set size, it was 



David ef al. BMC Bioinformatics 2010, 11:79 
httpy/www.biomedcentral.com/1 471-21 05/1 1/79 



Page 10 of 13 



only possible to perform a single holdout validation 
and LOO cross validation, whose results were consis- 
tent. However, for future work involving larger training 
sets, it would be possible to include measures and per- 
form more definitive tests to ensure that overfitting is 
eliminated or minimized. 

Conclusions 

This exploratory study indicates that the Naive Baye- 
sian classifier and decision trees may be used for 
"yes"- or "no"-type predictions on the amyloidogenicity 
of a sequence. Analysis of results from both methods 
suggests that prediction accuracy may be improved by 
optimizing the training set sizes, and by incorporating 
more information about the alterations brought about 
by mutations into the calculations. Some other factors 
that may be considered include hydrophilicity and 
charge changes brought about by the replacement 
residues, with respect to its location, as well as the 
way the mutations cluster from sequences with known 
structures. Another factor that might be considered is 
the sequence of immunoglobulin folding and the 
implications of having mutations in the N-terminal 
region, which is the first to be folded [53]. The further 
development of these classification techniques, includ- 
ing the possibility of creating a hybrid between Naive 
Bayesian and decision trees, appears to be worthwhile; 
these methods may eventually be adapted for predict- 
ing the amyloidogenicity of non-immunoglobulin 
sequences. 

Methods 

Sequences 

The training set, comprised of 143 amyloidogenic and 
158 non-amyloidogenic derivatives of the germlines 
were obtained from the National Center for Biotechnol- 
ogy Information (NCBI, http://www.ncbi.nlm.nih.gov/). 
A holdout test set comprised of 103 amyloidogenic and 
28 non-amyloidogenic sequences, chosen on account of 
the absence of gaps, as well as the possibility of assign- 
ing these unambiguously to a germline set, were also 
obtained from the NCBI. Sequences were assigned to 
the closest germline using ClustalW, and resulting align- 
ments were manually annotated. Kabat numbering and 
CDR/FR definitions were applied to all sequences. The 
non-amyloidogenic derivation sets were constructed 
from randomly chosen derivatives of each germline 
which have, as a derivation set, approximately the same 
total number of mutations as the amyloidogenic coun- 
terparts. The first five amino acid residues are omitted 
in the analysis, since these may have been primer- 
derived. All sequences of the amyloidogenic and non- 
amyloidogenic antibodies used in the analysis, which are 
identified by their NCBI accession codes, as well as their 



putative germline derivation, are in the supplementary 
information (additional file 2). 

Naive Bayesian Classification 

We generated a Naive Bayesian Classifier for each germ- 
line on the basis of its amyloidogenic and non-amyloi- 
dogenic derivatives. Briefly, the probability p of a 
mutation occurring at position x was quantified for both 
amyloidogenic (Pam) and non-amyloidogenic (Pnam) 
derivatives of the same germline. Raw values of Pam and 
Pnam can take the value of 0; to avoid this, we used the 
Laplace correction method, where 1 is added to the 
numerator and 2 to the denominator. The respective 
complements, qAM and qNAMi which represent the reten- 
tion of the residue, is given by 1 - Pam or 1 - Pnam> 
respectively. These probabilities are then used to calcu- 
late the amyloidogenic and non-amyloidogenic propensi- 
ties for a test sequence s derived from the same 
germline as the training set. Supposing that 5 has muta- 
tions at positions defined by the set M, the amyloido- 
genic probability AM will be calculated as: 

n/neM n,rmM 

while the non-amyloidogenic probability is calculated 
as: 

n,neM n,mM 

Pnam - ^nam^ ^ Pnam^ (5) 

3C=1 X=\ 

where x refers to the position (Figure 4). If AM is 
greater than NAM, then the sequence is classified as 
amyloidogenic; otherwise, it is classified as non-amyloi- 
dogenic. Classifier accuracy was cross-checked against 
both the training and test sets were used. Due to the 
limited number of sequences obtained, validation is pre- 
liminary, and consists of a LOO cross-validation, per- 
formed for all amyloidogenic and non-amyloidogenic 
derivatives, and a one-time holdout test validation. 

Decision tree generation and sequence classification 

A weighted decision tree was constructed to provide a 
quantitative estimate of both individual and joint contri- 
butions of mutations as functions of location (i.e. CDR/ 
PR), exposure and changes in sheet forming propensity. 
The steps for generating the tree are shown in Figure 5. 
Initially, separate mutation matrices for buried CDR 
residues, buried FR residues, exposed CDR residues, and 
exposed FR residues are generated for alignments of 
amyloidogenic and non-amyloidogenic derivatives, based 
on the algorithm described in [22]. Here, exposed resi- 
dues were defined as residues having > 25% accessible 
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Figure 5 Steps in generating and testing a weighted decision tree. To create a weighted decision tree, mutations from amyloidogenic and 
non-amyloidogenic derivatives of a single germline are organized into separate matrices tinat factor in location, exposure and sheet-forming 
propensity into account (Step 1). These matrices are visualized and analyzed for general trends that may be transformed into weights (Step 2). 
An initial tree is constructed from these information, which is tested against the training set (Step 3). From this testing, it became evident that 
certain paths can be used for maximally separating amyloidogenic and non-amyloidogenic derivatives of a germline, and that these paths are 
germline-dependent. We then generated a tree that takes the germline of origin into account, and which has different boosted paths. The fina 
step was to generate the classification threshold, which was determined from the analysis of scores for the test set (Step 4). This tree was then 
used to classify sequences in an independent, holdout test set (Step 5). 



surface; exposure information was generated for each 
alignment using structural homologues of the germline 
sequence (see supplementary information, additional file 
2). These were then visualized to facilitate easier analy- 
sis, then post-processed by subtracting the non-amyloi- 
dogenic from the amyloidogenic matrix image, resulting 
in an image where the relative intensities are propor- 
tional to the predominance of specific mutations. A bin- 
ary matrix containing mutations exclusive to amyloid- 
formers was also generated. In the matrices, residues 
were arranged according to increasing /J-sheet-forming 
propensities (Table 7) [54], with the original residues in 
the rows and the replacement residues in the columns, 
such that all mutations to the right of the diagonal are 
associated with increased sheet-forming propensities, 
while those to the left correspond to decreased sheet- 
forming propensities (Figure 2; Figure 5, step 1). The 
trends observed in these matrices (Figures 1, 2 and 5, 
step 2; Table 2) were then used as weights, which were 
associated with the branches of the tree. At this point, 
we determined if paths taken by amyloid and non-amy- 
loid-formers could be generalized, or if these showed 
germline dependence. This led to the identification of 
paths that may be used in maximizing separation 
between amyloidogenic and non-amyloidogenic deriva- 
tives per germline (Table 4; Figure 5, step 3); for 



Table 7 j3-sheet forming propensities of amino acids [54] 



Amino acid 


Ai G (l^cal mo!"') 


Thr 1.1 


He 


1.0 


Tyr 


0.96 


Phe 


0.86 


Val 


0.82 


Met 


0.72 


Ser 


0.70 


Trp 


0.54 


Cys 


0.52 


Leu 


0.51 


Arg 


0.45 


Lys 


0.27 


Gin 


0.23 


Glu 


0.01 


Ala 


0.00 


His 


-0.02 


Asn 


-0.08 


Asp 


-0.94 


Gly 


-1.2 


Pro 


<-3 



* N.A. indicates that no additional sequences were obtained for this germline. 
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instance, amyloidogenic derivatives of X93627 can be 
maximally separated from corresponding non-amyloido- 
genic derivatives by giving a tenfold higher score to 
mutations that follow the path leading to leaf 2 and a 
tenfold lower score for mutations leading to leaf 8. 
Boosted and decreased paths to specific leaves are indi- 
cated in Table 4 in boldface and italics, respectively. 
Consequently, tracing the path through the tree that 
describes each mutation yields a score, s, calculated as 
the product of the weights along the path. Using this 
strategy, the average amyloidogenic potential for every 
sequence, AMg^q, was calculated as follows: 



AM 



n 

E Pm 
m=l 
n 



(6) 



where s corresponds to scores of individual mutations, 
and n corresponds to the number of mutations in a 
sequence. Since s is amplified in certain paths, amyloi- 
dogenic sequences are expected to have higher AMg^q 
values. Thresholds for classifying sequences as amyloi- 
dogenic or non-amyloidogenic were defined per germ- 
line based on the average scores of amyloidogenic 
derivatives (Figure 5, step 4). Cross-validation was per- 
formed on the holdout test set (Figure 5, step 5). 



Additional file 1 : Comparison of predictions between a germline 
and an amyloidogenic derivative made using AGGRESCAN [16] and 
the PASTA server [2829]. This sliows tliat regions tinat may cause 
amyloidosis are predicted, with liiglnly similar profiles. However, no direct 
predictions are provided (i.e. tliat the germline is non-amyloidogenic, 
and that the derivative is amyloidogenic) in these methods. 
Click here for file 

[ http://www.biomedcentral.eom/content/supplementary/1 471 -21 05-1 1 - 

79-S1.PDF] 

Additional file 2: Amyloidogenic and non-amyloldogenic 
Immunoglobulin sequence alignments for each of the germline 
derivation sets, including the exposure data. The structure indicated 
at the end of each alignment refers to the structural template used as 
the basis for determining residue exposure. Sequences in red are those 
belonging to the holdout test set. 
Click here for file 

[ http://www.biomedcentral.eom/content/supplementary/1 471 -21 05-1 1 - 

79-S2.PDF] 
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